Salaries Exploratory Data Analysis
Manuela da Cruz Chadreque
10 Dezember 2020
Welcome to a quick project to practice your pandas skills! We will be using the SF Salaries Dataset from Kaggle!
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.api.types import CategoricalDtype
df= pd.read_csv('Salaries.csv')
# check the first rows
df.head()
Analysis:
We continue our inspection to the data .info() method to find out how many valid entries per variable/column, this is important also to spot missing values
df.shape
df.info()
# Print the value counts for 'Borough'
print(df['Agency'].unique())
Analysis:
#Drop Notes and Status since these columns does not have information
df.drop(['Id','Notes', 'Status','Agency'], axis=1, inplace=True)
df.columns
dic={'EmployeeName':'Employee', 'JobTitle':'Job'}
df.rename(columns=dic, inplace=True)
df.head()
cat_type = CategoricalDtype(categories=[2011, 2012, 2013, 2014], ordered=True)
df['Year'] = df['Year'].astype(cat_type)
df['Year'].dtype
#Missing values per column
print(df.isnull().sum())
print('__________________ ')
print(df.isna().sum()/len(df)*100)
#Visualizing missing values
sns.heatmap(df.isnull(), cbar=False)
Analysis:
df.duplicated().sum()
What is the average Base Pay ?
df['BasePay'].mean()
What was the average (mean) BasePay of all employees per year? (2011-2014) ?
BasePayPerYear=df.groupby('Year').mean()['BasePay']
print(BasePayPerYear)
BasePayPerYear.plot(kind='line')
plt.show()
Analysis:
What is the highest amount of OvertimePay in the dataset ?
df[df['OvertimePay']==df['OvertimePay'].max()]
Analysis:
What is the job title of JOSEPH DRISCOLL ?
Note: Use all caps, otherwise you may get an answer that doesn't match up (there is also a lowercase Joseph Driscoll).
df[df['Employee']=='JOSEPH DRISCOLL']['Job']
How much does JOSEPH DRISCOLL make (including benefits)?
df[df['Employee']=='JOSEPH DRISCOLL']['TotalPayBenefits']
What is the name of highest paid person (including benefits)?
#sal.loc[sal['TotalPayBenefits'].argmax()]
df[df['TotalPayBenefits']==df['TotalPayBenefits'].max()]
What is the name of lowest paid person (including benefits)?
df[df['TotalPayBenefits']==df['TotalPayBenefits'].min()]
How many unique job titles are there?
sal['JobTitle'].nunique()
What are the top 5 most common jobs?
Top5=df['Job'].value_counts().head(5)
print(Top5)
Top5.plot(kind='barh')
How many Job Titles were represented by only one person in 2013? (e.g. Job Titles with only one occurence in 2013?)
sum(df[df['Year']==2013]['Job'].value_counts()==1)
How many people have the word Chief in their job title?
def chief_string(title):
if 'chief' in title.lower():
return True
else:
return False
sum(df['Job'].apply(lambda x: chief_string(x)))
Is there a correlation between length of the Job Title string and Salary?
df['title_len'] = df['Job'].apply(len)
df[['title_len','TotalPayBenefits']].corr()
Analysis:
df.head()
s=['BasePay','OvertimePay','OtherPay','Benefits']
sns.heatmap(np.round(df[s].corr(),2), annot=True,cmap="YlGnBu")
Analysis: